import re
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import (linkage,
fcluster,
dendrogram)from scipy.cluster.vq import (kmeans,
vq,
whiten)
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt
import matplotlib.image as img
import seaborn as sns
%matplotlib inline
"ggplot") plt.style.use(
Overview
You have probably come across Google News, which automatically groups similar news articles under a topic. Have you ever wondered what process runs in the background to arrive at these groups? We will be exploring unsupervised learning through clustering using the SciPy library in Python. We will cover pre-processing of data and application of hierarchical and k-means clustering. We will explore player statistics from a popular football video game, FIFA 18. We will be able to quickly apply various clustering algorithms on data, visualize the clusters formed and analyze results.
Libraries
Introduction to Clustering
Before we are ready to classify news articles, we need to be introduced to the basics of clustering. We will familiarize ourselves with a class of machine learning algorithms called unsupervised learning and clustering, one of the popular unsupervised learning algorithms. We will explore two popular clustering techniques - hierarchical clustering and k-means clustering. We will conclude with basic pre-processing steps before we start clustering data.
Unsupervised learning: basics
Everyday example: Google news
- How does Google News classify articles?
- Unsupervised Learning Algorithm: Clustering
- Match frequent terms in articles to find similarity
What is unsupervised learning?
- A group of machine learning algorithms that find patterns in data
- Data for algorithms has not been labeled, classified or characterized
- The objective of the algorithm is to interpret any structure in the data
- Common unsupervised learning algorithms: clustering, neural networks, anomaly detection
What is clustering?
- The process of grouping items with similar characteristics
- Items in groups similar to each other than in other groups
- Example: distance between points on a 2D plane
Plotting data for clustering - Pokemon sightings
= [80, 93, 86, 98, 86, 9, 15, 3, 10, 20, 44, 56, 49, 62, 44]
x_coordinates = [87, 96, 95, 92, 92, 57, 49, 47, 59, 55, 25, 2, 10, 24, 10]
y_coordinates
= sns.scatterplot(x_coordinates, y_coordinates)
_ plt.show()
Visualizing helps in determining how many clusters are in the data.
Unsupervised learning in real world
Segmentation of learners at DataCamp based on courses they complete. The training data has no labels. As the training data has no labels, an unsupervised algorithm needs to be used to understand patterns in the data.
Pokémon sightings
There have been reports of sightings of rare, legendary Pokémon. We have been asked to investigate! We will plot the coordinates of sightings to find out where the Pokémon might be. The X and Y coordinates of the points are stored in list x_p and y_p, respectively
= [9, 6, 2, 3, 1, 7, 1, 6, 1, 7, 23, 26, 25, 23, 21, 23, 23, 20, 30, 23]
x_p = [8, 4, 10, 6, 0, 4, 10, 10, 6, 1, 29, 25, 30, 29, 29, 30, 25, 27, 26, 30]
y_p
= sns.scatterplot(x_p, y_p)
_ plt.show()
Notice the areas where the sightings are dense. This indicates that there is not one, but two legendary Pokémon out there!
Basics of cluster analysis
What is a cluster?
- A group of items with similar characteristics
- Google News: articles where similar words andword associations appear together
- Customer Segments
Clustering algorithms
- Hierarchical clustering
- K means clustering
- Other clustering algorithms: DBSCAN, Gaussian Methods
Hierarchical clustering in SciPy
= [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4, 10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
x_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4, 47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]
y_coordinates = pd.DataFrame({'x_cood':x_coordinates, 'y_cood':y_coordinates})
df_c df_c.head()
x_cood | y_cood | |
---|---|---|
0 | 80.1 | 87.2 |
1 | 93.1 | 96.1 |
2 | 86.6 | 95.6 |
3 | 98.5 | 92.4 |
4 | 86.4 | 92.4 |
= linkage(df_c, method="ward")
Z_c 'cluster_labels'] = fcluster(Z_c, 3, criterion="maxclust")
df_c[= sns.scatterplot(data=df_c, x="x_cood", y="y_cood", hue="cluster_labels", palette="RdGy")
_ plt.show()
K-means clustering in SciPy
= pd.DataFrame({'x_cood':x_coordinates, 'y_cood':y_coordinates})
df_c = kmeans(df_c, 3)
centroids_c, _ "cluster_labels"], _ = vq(df_c, centroids_c)
df_c[= sns.scatterplot(data=df_c, x="x_cood", y="y_cood", hue="cluster_labels", palette="RdGy")
_ plt.show()
Pokémon sightings: hierarchical clustering
We are going to continue the investigation into the sightings of legendary Pokémon. In the scatter plot we identified two areas where Pokémon sightings were dense. This means that the points seem to separate into two clusters. We will form two clusters of the sightings using hierarchical clustering.
= pd.DataFrame({'x':x_p, 'y':y_p})
df_p df_p.head()
x | y | |
---|---|---|
0 | 9 | 8 |
1 | 6 | 4 |
2 | 2 | 10 |
3 | 3 | 6 |
4 | 1 | 0 |
‘x’ and ‘y’ are columns of X and Y coordinates of the locations of sightings, stored in a Pandas data frame,
# Use the linkage() function to compute distance
= linkage(df_p, 'ward')
Z_p
# Generate cluster labels for each data point with two clusters
'cluster_labels'] = fcluster(Z_p, 2, criterion='maxclust')
df_p[
# Plot the points with seaborn
="x", y="y", hue="cluster_labels", data=df_p)
sns.scatterplot(x plt.show()
the resulting plot has an extra cluster labelled 0 in the legend.
Pokémon sightings: k-means clustering
We are going to continue the investigation into the sightings of legendary Pokémon. We will use the same example of Pokémon sightings. We will form clusters of the sightings using k-means clustering.
x and y are columns of X and Y coordinates of the locations of sightings, stored in a Pandas data frame
df_p.dtypes
x int64
y int64
cluster_labels int32
dtype: object
= df_p.apply(lambda x: x.astype("float")) df_p
# Compute cluster centers
= kmeans(df_p, 2)
centroids_p, _
# Assign cluster labels to each data point
'cluster_labels'], _ = vq(df_p, centroids_p)
df_p[
# Plot the points with seaborn
="x", y="y", hue="cluster_labels", data=df_p)
sns.scatterplot(x plt.show()
Data preparation for cluster analysis
Why do we need to prepare data for clustering?
- Variables have incomparable units (product dimensions in cm, price in $)
- Variables with same units have vastly different scales and variances (expenditures on cereals, travel)
- Data in raw form may lead to bias in clustering
- Clusters may be heavily dependent on one variable
- Solution: normalization of individual variables
Normalization of data
- Normalization: process of rescaling data to a standard deviation of 1
= x / std_dev(x) x_new
= [5, 1, 3, 3, 2, 3, 3, 8, 1, 2, 2, 3, 5]
data = whiten(data)
scaled_data scaled_data
array([2.72733941, 0.54546788, 1.63640365, 1.63640365, 1.09093577,
1.63640365, 1.63640365, 4.36374306, 0.54546788, 1.09093577,
1.09093577, 1.63640365, 2.72733941])
Illustration: normalization of data
= sns.lineplot(x=range(len(data)), y=data, label="original")
_ = sns.lineplot(x=range(len(data)), y=scaled_data, label='scaled')
_ plt.show()
Normalize basic list data
let us try to normalize some data. goals_for
is a list of goals scored by a football team in their last ten matches. Let us standardize the data using the whiten()
function.
= [4,3,2,3,1,1,2,0,1,4]
goals_for
# Use the whiten() function to standardize the data
= whiten(goals_for)
scaled_goals_for scaled_goals_for
array([3.07692308, 2.30769231, 1.53846154, 2.30769231, 0.76923077,
0.76923077, 1.53846154, 0. , 0.76923077, 3.07692308])
the scaled values have less variations in them.
Visualize normalized data
After normalizing the data, we can compare the scaled data to the original data to see the difference.
= sns.lineplot(x=range(len(goals_for)), y=goals_for, label="original")
_ = sns.lineplot(x=range(len(goals_for)), y=scaled_goals_for, label="scaled")
_ plt.show()
scaled values have lower variations in them.
Normalization of small numbers
# Prepare data
= [0.0025, 0.001, -0.0005, -0.001, -0.0005, 0.0025, -0.001, -0.0015, -0.001, 0.0005]
rate_cuts
# Use the whiten() function to standardize the data
= whiten(rate_cuts)
scaled_rate_cuts
# Plot original data
='original')
plt.plot(rate_cuts, label
# Plot scaled data
='scaled')
plt.plot(scaled_rate_cuts, label
plt.legend() plt.show()
the original data are negligible as compared to the scaled data
FIFA 18: Normalize data
FIFA 18 is a football video game that was released in 2017 for PC and consoles. The dataset that we are about to work on contains data on the 1000 top individual players in the game. We will explore various features of the data as we move ahead.
= pd.read_csv("datasets/fifa.csv")
fifa fifa.head()
ID | name | full_name | club | club_logo | special | age | league | birth_date | height_cm | ... | prefers_cb | prefers_lb | prefers_lwb | prefers_ls | prefers_lf | prefers_lam | prefers_lcm | prefers_ldm | prefers_lcb | prefers_gk | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20801 | Cristiano Ronaldo | C. Ronaldo dos Santos Aveiro | Real Madrid CF | https://cdn.sofifa.org/18/teams/243.png | 2228 | 32 | Spanish Primera División | 1985-02-05 | 185.0 | ... | False | False | False | False | False | False | False | False | False | False |
1 | 158023 | L. Messi | Lionel Messi | FC Barcelona | https://cdn.sofifa.org/18/teams/241.png | 2158 | 30 | Spanish Primera División | 1987-06-24 | 170.0 | ... | False | False | False | False | False | False | False | False | False | False |
2 | 190871 | Neymar | Neymar da Silva Santos Jr. | Paris Saint-Germain | https://cdn.sofifa.org/18/teams/73.png | 2100 | 25 | French Ligue 1 | 1992-02-05 | 175.0 | ... | False | False | False | False | False | False | False | False | False | False |
3 | 176580 | L. Suárez | Luis Suárez | FC Barcelona | https://cdn.sofifa.org/18/teams/241.png | 2291 | 30 | Spanish Primera División | 1987-01-24 | 182.0 | ... | False | False | False | False | False | False | False | False | False | False |
4 | 167495 | M. Neuer | Manuel Neuer | FC Bayern Munich | https://cdn.sofifa.org/18/teams/21.png | 1493 | 31 | German Bundesliga | 1986-03-27 | 193.0 | ... | False | False | False | False | False | False | False | False | False | True |
5 rows × 185 columns
We will work with two columns, eur_wage
, the wage of a player in Euros and eur_value
, their current transfer market value.
# Scale wage and value
'scaled_wage'] = whiten(fifa['eur_wage'])
fifa['scaled_value'] = whiten(fifa['eur_value'])
fifa[
# Plot the two columns in a scatter plot
="scaled_wage", y="scaled_value", kind='scatter')
fifa.plot(x plt.show()
# Check mean and standard deviation of scaled values
'scaled_wage', 'scaled_value']].describe() fifa[[
scaled_wage | scaled_value | |
---|---|---|
count | 1000.000000 | 1000.000000 |
mean | 1.119812 | 1.306272 |
std | 1.000500 | 1.000500 |
min | 0.000000 | 0.000000 |
25% | 0.467717 | 0.730412 |
50% | 0.854794 | 1.022576 |
75% | 1.407184 | 1.542995 |
max | 9.112425 | 8.984064 |
the scaled values have a standard deviation of 1.
Hierarchical Clustering
We willl focus on a popular clustering algorithm - hierarchical clustering - and its implementation in SciPy. In addition to the procedure to perform hierarchical clustering, it attempts to help you answer an important question - how many clusters are present in your data? We will conclude with a discussion on the limitations of hierarchical clustering and discuss considerations while using hierarchical clustering.
Basics of hierarchical clustering
Creating a distance matrix using linkage
scipy.cluster.hierarchy.linkage(observations,='single',
method='euclidean',
metric=False
optimal_ordering )
method
: how to calculate the proximity of clustersmetric
: distance metricoptimal_ordering
: order data points
Which method should use?
- single: based on two closest objects
- complete: based on two farthest objects
- average: based on the arithmetic mean of all objects
- centroid: based on the geometric mean of all objects
- median: based on the median of all objects
- ward: based on the sum of squares
Create cluster labels with fcluster
scipy.cluster.hierarchy.fcluster(distance_matrix,
num_clusters,
criterion )
distance_matrix
: output oflinkage()
methodnum_clusters
: number of clusterscriterion
: how to decide thresholds to form clusters
Final thoughts on selecting a method
- No one right method for all
- Need to carefully understand the distribution of data
Hierarchical clustering: ward method
It is time for Comic-Con! Comic-Con is an annual comic-based convention held in major cities in the world. We have the data of last year’s footfall, the number of people at the convention ground at a given time. We would like to decide the location of the stall to maximize sales. Using the ward method, we’ll apply hierarchical clustering to find the two points of attraction in the area.
= pd.read_csv("datasets/comic_con.csv")
comic_con comic_con.head()
x_coordinate | y_coordinate | x_scaled | y_scaled | |
---|---|---|---|---|
0 | 17 | 4 | 0.509349 | 0.090010 |
1 | 20 | 6 | 0.599234 | 0.135015 |
2 | 35 | 0 | 1.048660 | 0.000000 |
3 | 14 | 0 | 0.419464 | 0.000000 |
4 | 37 | 4 | 1.108583 | 0.090010 |
# Use the linkage() function
= linkage(comic_con[['x_scaled', 'y_scaled']], method = "ward", metric = 'euclidean')
distance_matrix_cc
# Assign cluster labels
'cluster_labels'] = fcluster(distance_matrix_cc, 2, criterion='maxclust')
comic_con[
# Plot clusters
='x_scaled', y='y_scaled',
sns.scatterplot(x='cluster_labels', data = comic_con)
hue plt.show()
the two clusters correspond to the points of attractions in the figure towards the bottom (a stage) and the top right (an interesting stall).
Hierarchical clustering: single method
Let us use the same footfall dataset and check if any changes are seen if we use a different method for clustering.
# Use the linkage() function
= linkage(comic_con[['x_scaled', 'y_scaled']], method = "single", metric = "euclidean")
distance_matrix_cc
# Assign cluster labels
'cluster_labels'] = fcluster(distance_matrix_cc, 2, criterion="maxclust")
comic_con[
# Plot clusters
='x_scaled', y='y_scaled',
sns.scatterplot(x='cluster_labels', data = comic_con)
hue plt.show()
the clusters formed are not different from the ones created using the ward method.
Hierarchical clustering: complete method
For the third and final time, let us use the same footfall dataset and check if any changes are seen if we use a different method for clustering.
# Use the linkage() function
= linkage(comic_con[['x_scaled', 'y_scaled']], method="complete")
distance_matrix_cc
# Assign cluster labels
'cluster_labels'] = fcluster(distance_matrix_cc, 2, criterion="maxclust")
comic_con[
# Plot clusters
='x_scaled', y='y_scaled',
sns.scatterplot(x='cluster_labels', data = comic_con)
hue plt.show()
Coincidentally, the clusters formed are not different from the ward or single methods.
Visualize clusters
Why visualize clusters?
- Try to make sense of the clusters formed
- An additional step in validation of clusters
- Spot trends in data
An introduction to seaborn
seaborn
: a Python data visualization library based on matplotlib- Has better, easily modiable aesthetics than matplotlib!
- Contains functions that make data visualization tasks easy in the context of data analytics
- Use case for clustering:
hue
parameter for plots
Visualize clusters with matplotlib
# Plot a scatter plot
='x_scaled',
comic_con.plot.scatter(x='y_scaled',
y='cluster_labels')
c plt.show()
Visualize clusters with seaborn
# Plot a scatter plot using seaborn
='x_scaled',
sns.scatterplot(x='y_scaled',
y='cluster_labels',
hue= comic_con)
data plt.show()
How many clusters?
Introduction to dendrograms
- Strategy till now - decide clusters on visual inspection
- Dendrograms help in showing progressions as clusters are merged
- A dendrogram is a branching diagram that demonstrates how each cluster is composed by branching out into its child nodes
Create a dendrogram
Dendrograms are branching diagrams that show the merging of clusters as we move through the distance matrix. Let us use the Comic Con footfall data to create a dendrogram.
# Create a dendrogram
= dendrogram(distance_matrix_cc)
dn_cc
# Display the dendogram
plt.show()
the top two clusters are farthest away from each other.
Limitations of hierarchical clustering
Measuring speed in hierarchical clustering
timeit
module- Measure the speed of
.linkage()
method- Use randomly generated points
- Run various iterations to extrapolate
= 100
points_s = pd.DataFrame({'x':np.random.sample(points_s),
df_s 'y':np.random.sample(points_s)})
%timeit linkage(df_s[['x', 'y']], method='ward', metric='euclidean')
2 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Comparison of runtime of linkage method
- Increasing runtime with data points
- Quadratic increase of runtime
- Not feasible for large datasets
%timeit linkage(comic_con[['x_scaled', 'y_scaled']], method="complete", metric='euclidean')
1.47 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
FIFA 18: exploring defenders
In the FIFA 18 dataset, various attributes of players are present. Two such attributes are:
- sliding tackle: a number between 0-99 which signifies how accurate a player is able to perform sliding tackles
- aggression: a number between 0-99 which signifies the commitment and will of a player
These are typically high in defense-minded players. We will perform clustering based on these attributes in the data.
'sliding_tackle', 'aggression']].head() fifa[[
sliding_tackle | aggression | |
---|---|---|
0 | 23 | 63 |
1 | 26 | 48 |
2 | 33 | 56 |
3 | 38 | 78 |
4 | 11 | 29 |
'scaled_sliding_tackle'] = whiten(fifa.sliding_tackle)
fifa['scaled_aggression'] = whiten(fifa.aggression) fifa[
# Fit the data into a hierarchical clustering algorithm
= linkage(fifa[['scaled_sliding_tackle', 'scaled_aggression']], 'ward')
distance_matrix_f # Assign cluster labels to each row of data
'cluster_labels'] = fcluster(distance_matrix_f, 3, criterion='maxclust')
fifa[# Display cluster centers of each cluster
'scaled_sliding_tackle', 'scaled_aggression', 'cluster_labels']].groupby('cluster_labels').mean() fifa[[
scaled_sliding_tackle | scaled_aggression | |
---|---|---|
cluster_labels | ||
1 | 2.837810 | 4.280968 |
2 | 0.579966 | 1.766698 |
3 | 1.166930 | 3.415214 |
# Create a scatter plot through seaborn
="scaled_sliding_tackle", y="scaled_aggression", hue="cluster_labels", data=fifa)
sns.scatterplot(x plt.show()
K-Means Clustering
Exploring a different clustering algorithm - k-means clustering - and its implementation in SciPy. K-means clustering overcomes the biggest drawback of hierarchical clustering. As dendrograms are specific to hierarchical clustering, we will discuss one method to find the number of clusters before running k-means clustering. We will conclude with a discussion on the limitations of k-means clustering and discuss considerations while using this algorithm.
Basics of k-means clustering
Why k-means clustering?
- A critical drawback of hierarchical clustering: runtime
- K means runs signicantly faster on large datasets
Step 1: Generate cluster centers
iter, thresh, check_finite) kmeans(obs, k_or_guess,
obs
: standardized observationsk_or_guess
: number of clustersiter
: number of iterations (default: 20)thres
: threshold (default: 1e-05)check_finite
: whether to check if observations contain only finite numbers (default: True)- Returns two objects:
- cluster centers, distortion
Step 2: Generate cluster labels
=True) vq(obs, code_book, check_finite
obs
: standardized observationscode_book
: cluster centerscheck_finite
: whether to check if observations contain only finite numbers (default: True)- Returns two objects:
- a list of cluster labels,
- a list of distortions
A note on distortions
kmeans
returns a single value of distortionsvq
returns a list of distortions.
Running k-means
K-means clustering
Let us use the Comic Con dataset and check how k-means clustering works on it.
the two steps of k-means clustering:
- Define cluster centers through
kmeans()
function. It has two required arguments: observations and number of clusters. - Assign cluster labels through the
vq()
function. It has two required arguments: observations and cluster centers.
# Generate cluster centers
= kmeans(comic_con[['x_scaled', 'y_scaled']], 2)
cluster_centers_cc, distortion_cc
# Assign cluster labels
'cluster_labels'], distortion_list_cc = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers_cc)
comic_con[
# Plot clusters
='x_scaled', y='y_scaled',
sns.scatterplot(x='cluster_labels', data = comic_con)
hue plt.show()
Runtime of k-means clustering
%timeit kmeans(fifa[['scaled_sliding_tackle', 'scaled_aggression']], 3)
31.9 ms ± 3.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It took only about 5 seconds to run hierarchical clustering on this data, but only 50 milliseconds to run k-means clustering.
How many clusters?
How to find the right k?
- No absolute method to find right number of clusters (k) in k-means clustering
- Elbow method
Distortions revisited
- Distortion: sum of squared distances of points from cluster centers
- Decreases with an increasing number ofclusters
- Becomes zero when the number of clusters equals the number of points
- Elbow plot: line plot between cluster centers and distortion
Elbow method
- Elbow plot: plot of the number of clusters and distortion
- Elbow plot helps indicate number of clusters present in data
Final thoughts on using the elbow method
- Only gives an indication of optimal k (numbers of clusters)
- Does not always pinpoint how many k (numbers of clusters)
- Other methods: average silhouette and gap statistic
Elbow method on distinct clusters
Let us use the comic con data set to see how the elbow plot looks on a data set with distinct, well-defined clusters.
= []
distortions_cc = range(1, 7)
num_clusters_cc
# Create a list of distortions from the kmeans function
for i in num_clusters_cc:
= kmeans(comic_con[['x_scaled', 'y_scaled']], i)
cluster_centers_cc, distortion_cc
distortions_cc.append(distortion_cc)
# Create a data frame with two lists - num_clusters, distortions
= pd.DataFrame({'num_clusters': num_clusters_cc, 'distortions': distortions_cc})
elbow_plot_cc
# Creat a line plot of num_clusters and distortions
="num_clusters", y="distortions", data = elbow_plot_cc)
sns.lineplot(x
plt.xticks(num_clusters_cc) plt.show()
= pd.read_csv("datasets/uniform_data.csv")
uniform_data = []
distortions_u = range(2, 7)
num_clusters_u
# Create a list of distortions from the kmeans function
for i in num_clusters_u:
= kmeans(uniform_data[['x_scaled', 'y_scaled']], i)
cluster_centers_u, distortion_u
distortions_u.append(distortion_u)
# Create a data frame with two lists - number of clusters and distortions
= pd.DataFrame({'num_clusters': num_clusters_u, 'distortions': distortions_u})
elbow_plot_u
# Creat a line plot of num_clusters and distortions
="num_clusters", y="distortions", data=elbow_plot_u)
sns.lineplot(x
plt.xticks(num_clusters_u) plt.show()
There is no well defined elbow in this plot!
Limitations of k-means clustering
- How to find the right K (number of clusters)?
- Impact of seeds
- Biased towards equal sized clusters
Final thoughts
- Each technique has its pros and cons
- Consider your data size and patterns before deciding on algorithm
- Clustering is exploratory phase of analysis
Impact of seeds on distinct clusters
whether seeds impact the clusters in the Comic Con data, where the clusters are well-defined.
# Initialize seed
0)
np.random.seed(
# Run kmeans clustering
= kmeans(comic_con[['x_scaled', 'y_scaled']], 2)
cluster_centers_cc, distortion_cc 'cluster_labels'], distortion_list_cc = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers_cc)
comic_con[
# Plot the scatterplot
='x_scaled', y='y_scaled',
sns.scatterplot(x='cluster_labels', data = comic_con)
hue plt.show()
# Initialize seed
1,2,1000])
np.random.seed([
# Run kmeans clustering
= kmeans(comic_con[['x_scaled', 'y_scaled']], 2)
cluster_centers_cc, distortion_cc 'cluster_labels'], distortion_list_cc = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers_cc)
comic_con[
# Plot the scatterplot
='x_scaled', y='y_scaled',
sns.scatterplot(x='cluster_labels', data = comic_con)
hue plt.show()
Uniform clustering patterns
let us look at the bias in k-means clustering towards the formation of uniform clusters. Let us use a mouse-like dataset for our next exercise. A mouse-like dataset is a group of points that resemble the head of a mouse: it has three clusters of points arranged in circles, one each for the face and two ears of a mouse.
= pd.read_csv("datasets/mouse.csv")
mouse # Generate cluster centers
= kmeans(mouse[['x_scaled', 'y_scaled']], 3)
cluster_centers_m, distortion_m
# Assign cluster labels
'cluster_labels'], distortion_list_m = vq(mouse[['x_scaled', 'y_scaled']], cluster_centers_m)
mouse[
# Plot clusters
='x_scaled', y='y_scaled',
sns.scatterplot(x='cluster_labels', data = mouse)
hue plt.show()
kmeans is unable to capture the three visible clusters clearly, and the two clusters towards the top have taken in some points along the boundary. This happens due to the underlying assumption in kmeans algorithm to minimize distortions which leads to clusters that are similar in terms of area.
FIFA 18: defenders revisited
In the FIFA 18 dataset, various attributes of players are present. Two such attributes are:
- defending: a number which signifies the defending attributes of a player
- physical: a number which signifies the physical attributes of a player
These are typically defense-minded players. We will perform clustering based on these attributes in the data.
= pd.read_csv("datasets/fifa2.csv")
fifa
# Set up a random seed in numpy
1000, 2000])
np.random.seed([
# Fit the data into a k-means algorithm
= kmeans(fifa[['scaled_def', 'scaled_phy']], 3)
cluster_centers,_ # Assign cluster labels
'cluster_labels'],_ = vq(fifa[['scaled_def', 'scaled_phy']], cluster_centers)
fifa[# Display cluster centers
'scaled_def', 'scaled_phy', 'cluster_labels']].groupby('cluster_labels').mean() fifa[[
scaled_def | scaled_phy | |
---|---|---|
cluster_labels | ||
0 | 3.743692 | 8.867419 |
1 | 1.865936 | 7.082691 |
2 | 2.096297 | 8.944870 |
# Create a scatter plot through seaborn
="scaled_def", y="scaled_phy", hue="cluster_labels", data=fifa)
sns.scatterplot(x plt.show()
the seed has an impact on clustering as the data is uniformly distributed.
Clustering in Real World
Applying clustering knowledge to real-world problems. We will explore the process of finding dominant colors in an image, before moving on to the problem - clustering of news articles. We will conclude with a discussion on clustering with multiple variables, which makes it difficult to visualize all the data.
Dominant colors in images
- All images consist of pixelsEach pixel has three values: Red, Green and Blue
- Pixel color: combination of these RGB values
- Perform k-means on standardized RGB values to find cluster centers
- Uses: Identifying features in satellite images
Tools to find dominant colors
- Convert image to pixels:
matplotlib.image.imread
- Display colors of cluster centers:
matplotlib.pyplot.imshow
= img.imread("datasets/sea.jpg")
image image.shape
(390, 632, 3)
0][:1] image[
array([[255, 255, 255]], dtype=uint8)
plt.imshow(image)
<matplotlib.image.AxesImage at 0x2088a4426a0>
= []
r = []
g = []
b for row in image:
for pixel in row:
= pixel
temp_r, temp_g, temp_b
r.append(temp_r)
g.append(temp_g) b.append(temp_b)
Data frame with RGB values
= pd.DataFrame({'red':r, 'green':g, 'blue':b})
pixels pixels.head()
red | green | blue | |
---|---|---|---|
0 | 255 | 255 | 255 |
1 | 255 | 255 | 255 |
2 | 255 | 255 | 255 |
3 | 255 | 255 | 255 |
4 | 255 | 255 | 255 |
Create an elbow plot
'scaled_red', 'scaled_blue', 'scaled_green']] = pixels[['red', 'blue', 'green']].apply(lambda x: x/np.std(x)*255)
pixels[[= []
distortions_i = []
num_clusters_i
for i in num_clusters_i:
= kmeans(pixels[['scaled_red', 'scaled_blue', 'scaled_green']], i)
cluster_centers_i, distortion_i
distortions_i.append(distortion_i)
= pd.DataFrame({'num_clusters':num_clusters_i, 'distortions':distortions_i})
elbow_plot_i = sns.lineplot(data=elbow_plot_i, x="num_clusters", y='distortions')
_
plt.xticks(num_clusters_i) plt.show()
Extract RGB values from image
There are broadly three steps to find the dominant colors in an image:
- Extract RGB values into three lists.
- Perform k-means clustering on scaled RGB values.
- Display the colors of cluster centers.
= pd.read_csv("datasets/batman.csv")
batman_df batman_df.head()
red | blue | green | scaled_red | scaled_blue | scaled_green | |
---|---|---|---|---|---|---|
0 | 10 | 15 | 9 | 0.134338 | 0.179734 | 0.126269 |
1 | 14 | 49 | 36 | 0.188074 | 0.587133 | 0.505076 |
2 | 55 | 125 | 103 | 0.738862 | 1.497787 | 1.445077 |
3 | 35 | 129 | 98 | 0.470185 | 1.545716 | 1.374928 |
4 | 38 | 134 | 101 | 0.510486 | 1.605628 | 1.417017 |
= []
distortions = range(1, 7)
num_clusters
# Create a list of distortions from the kmeans function
for i in num_clusters:
= kmeans(batman_df[['scaled_red', 'scaled_blue', 'scaled_green']], i)
cluster_centers, distortion
distortions.append(distortion)
# Create a data frame with two lists, num_clusters and distortions
= pd.DataFrame({'num_clusters':num_clusters, 'distortions':distortions})
elbow_plot
# Create a line plot of num_clusters and distortions
="num_clusters", y="distortions", data = elbow_plot)
sns.lineplot(x
plt.xticks(num_clusters) plt.show()
there are three distinct colors present in the image, which is supported by the elbow plot.
Display dominant colors
To display the dominant colors, we will convert the colors of the cluster centers to their raw values and then converted them to the range of 0-1, using the following formula:
= standardized_pixel * pixel_std / 255 converted_pixel
# Get standard deviations of each color
= batman_df[['red', 'green', 'blue']].std()
r_std, g_std, b_std
= []
colors for cluster_center in cluster_centers:
= cluster_center
scaled_r, scaled_g, scaled_b # Convert each standardized value to scaled value
colors.append((* r_std / 255,
scaled_r * g_std / 255,
scaled_g * b_std / 255
scaled_b
))
# Display colors of cluster centers
plt.imshow([colors]) plt.show()
Document clustering
concepts
- Clean data before processing
- Determine the importance of the terms in a document (in TF-IDF matrix)
- Cluster the TF-IDF matrix4. Find top terms, documents in each cluste
Clean and tokenize data
- Convert text into smaller parts called tokens, clean data for processing
TF-IDF (Term Frequency - Inverse DocumentFrequency)
- A weighted measure: evaluate how important a word is to a document in a collection
Clustering with sparse matrix
- kmeans() in SciPy does not support sparse matrices
- Use
.todense()
to convert to a matrix
= kmeans(tfidf_matrix.todense(), num_clusters) cluster_centers, distortion
Top terms per cluster
- Cluster centers: lists with a size equal to the number of terms
- Each value in the cluster center is its importance
- Create a dictionary and print top terms
More considerations
- Work with hyperlinks, emoticons etc.
- Normalize words (run, ran, running -> run)
.todense()
may not work with large datasets
def remove_noise(text, stop_words = []):
= word_tokenize(text)
tokens = []
cleaned_tokens for token in tokens:
= re.sub('[^A-Za-z0-9]+', '', token)
token if len(token) > 1 and token.lower() not in stop_words:
# Get lowercase
cleaned_tokens.append(token.lower()) return cleaned_tokens
"It is lovely weather we are having. I hope the weather continues.") remove_noise(
['it',
'is',
'lovely',
'weather',
'we',
'are',
'having',
'hope',
'the',
'weather',
'continues']
TF-IDF of movie plots
Let us use the plots of randomly selected movies to perform document clustering on. Before performing clustering on documents, they need to be cleaned of any unwanted noise (such as special characters and stop words) and converted into a sparse matrix through TF-IDF of the documents.
= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'youre', 'youve', 'youll', 'youd', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'shes', 'her', 'hers', 'herself', 'it', 'its', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'thatll', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'dont', 'should', 'shouldve', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'arent', 'couldn', 'couldnt', 'didn', 'didnt', 'doesn', 'doesnt', 'hadn', 'hadnt', 'hasn', 'hasnt', 'haven', 'havent', 'isn', 'isnt', 'ma', 'mightn', 'mightnt', 'mustn', 'mustnt', 'needn', 'neednt', 'shan', 'shant', 'shouldn', 'shouldnt', 'wasn', 'wasnt', 'weren', 'werent', 'won', 'wont', 'wouldn', 'wouldnt'] stop_words_2
"It is lovely weather we are having. I hope the weather continues.", stop_words=stop_words_2) remove_noise(
['lovely', 'weather', 'hope', 'weather', 'continues']
def remove_noise(text, stop_words = stop_words_2):
= word_tokenize(text)
tokens = []
cleaned_tokens for token in tokens:
= re.sub('[^A-Za-z0-9]+', '', token)
token if len(token) > 1 and token.lower() not in stop_words:
# Get lowercase
cleaned_tokens.append(token.lower()) return cleaned_tokens
= pd.read_csv("datasets/plots.csv")['0'].to_list()
plots 0][:10] plots[
'Cable Hogu'
# Initialize TfidfVectorizer
= TfidfVectorizer(min_df=.1, max_df=.75, max_features=50, tokenizer=remove_noise)
tfidf_vectorizer
# Use the .fit_transform() method on the list plots
= tfidf_vectorizer.fit_transform(plots) tfidf_matrix
Top terms in movie clusters
= 2
num_clusters
# Generate cluster centers through the kmeans function
= kmeans(tfidf_matrix.todense(), num_clusters)
cluster_centers, distortion
# Generate terms from the tfidf_vectorizer object
= tfidf_vectorizer.get_feature_names()
terms
for i in range(num_clusters):
# Sort the terms and print top 3 terms
= dict(zip(terms, cluster_centers[i]))
center_terms = sorted(center_terms, key=center_terms.get, reverse=True)
sorted_terms print(sorted_terms[:3])
['police', 'man', 'one']
['father', 'back', 'one']
Clustering with multiple features
Feature reduction
- Factoranalysis
- Multidimensional scaling
Clustering with many features
Reduce features using a technique like Factor Analysis. explore steps to reduce the number of features.
Basic checks on clusters
In the FIFA 18 dataset, we have concentrated on defenders in previous exercises. Let us try to focus on attacking attributes of a player. Pace (pac), Dribbling (dri) and Shooting (sho) are features that are present in attack minded players.
= pd.read_csv("datasets/fifa3.csv")
fifa # Print the size of the clusters
"cluster_labels")['ID'].count() fifa.groupby(
cluster_labels
0 83
1 107
2 60
Name: ID, dtype: int64
# Print the mean value of wages in each cluster
"cluster_labels"])['eur_wage'].mean() fifa.groupby([
cluster_labels
0 132108.433735
1 130308.411215
2 117583.333333
Name: eur_wage, dtype: float64
the cluster sizes are not very different, and there are no significant differences that can be seen in the wages. Further analysis is required to validate these clusters.
FIFA 18: what makes a complete player?
The overall level of a player in FIFA 18 is defined by six characteristics: pace (pac), shooting (sho), passing (pas), dribbling (dri), defending (def), physical (phy).
Here is a sample card:
= ['pac', 'sho', 'pas', 'dri', 'def', 'phy']
features= ['scaled_pac',
scaled_features 'scaled_sho',
'scaled_pas',
'scaled_dri',
'scaled_def',
'scaled_phy']
= fifa[features].apply(lambda x: whiten(x))
fifa[scaled_features] # Create centroids with kmeans for 2 clusters
= kmeans(fifa[scaled_features], 2)
cluster_centers,_ # Assign cluster labels and print cluster centers
'cluster_labels'], _ = vq(fifa[scaled_features], cluster_centers)
fifa["cluster_labels")[scaled_features].mean() fifa.groupby(
scaled_pac | scaled_sho | scaled_pas | scaled_dri | scaled_def | scaled_phy | |
---|---|---|---|---|---|---|
cluster_labels | ||||||
0 | 6.617743 | 3.885153 | 7.353643 | 7.148098 | 3.862353 | 9.009407 |
1 | 7.762181 | 5.610629 | 8.620873 | 8.968266 | 2.262328 | 8.009867 |
# Plot cluster centers to visualize clusters
'cluster_labels')[scaled_features].mean().plot(legend=True, kind="bar")
fifa.groupby( plt.show()
# Get the name column of first 5 players in each cluster
for cluster in fifa['cluster_labels'].unique():
print(cluster, fifa[fifa['cluster_labels'] == cluster]['name'].values[:5])
1 ['Cristiano Ronaldo' 'L. Messi' 'Neymar' 'L. Suárez' 'M. Neuer']
0 ['Sergio Ramos' 'G. Chiellini' 'L. Bonucci' 'J. Boateng' 'D. Godín']
the top players in each cluster are representative of the overall characteristics of the cluster - one of the clusters primarily represents attackers, whereas the other represents defenders. Surprisingly, a top goalkeeper Manuel Neuer is seen in the attackers group, but he is known for going out of the box and participating in open play, which are reflected in his FIFA 18 attributes.